1 Introduction

As per Kaggle Website:

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Throughout this analysis we will perform feature engineering on predictors one by one in order to try to form the most accurate prediction for house prices. By the end of this analysis the models of random forest, lasso, and cubist will be performed in order to make the final predictions. Of the three, cubist ultimately performed the best and was chosen as the final submission.

2 Libraries

library(ggplot2)
library(plyr,include.only = "revalue")
library(dplyr)
library(caret)
library(gridExtra)
library(e1071) #naive bayes
library(corrplot)
library(Metrics)
library(earth)
library(knitr)
library(gt)
options(scipen=999)

3 Data

There will be two data sets used in this analysis, the train data set and the test data set. The train data set contains 1460 observations and 81 variables and will be used for the data analysis and model building. The test data set will only used in the end to make the final predictions and contains 1459 rows and 80 variables, with the missing variable being the sale price which we are trying to predict.

Train Data Set:

train <- read.csv("train.csv",stringsAsFactors = FALSE)
gt(head(train))
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000

Test Data Set:

test <- read.csv("test.csv",stringsAsFactors = FALSE)

gt(head(test))
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
1461 20 RH 80 11622 Pave NA Reg Lvl AllPub Inside Gtl NAmes Feedr Norm 1Fam 1Story 5 6 1961 1961 Gable CompShg VinylSd VinylSd None 0 TA TA CBlock TA TA No Rec 468 LwQ 144 270 882 GasA TA Y SBrkr 896 0 0 896 0 0 1 0 2 1 TA 5 Typ 0 NA Attchd 1961 Unf 1 730 TA TA Y 140 0 0 0 120 0 NA MnPrv NA 0 6 2010 WD Normal
1462 20 RL 81 14267 Pave NA IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 6 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108 TA TA CBlock TA TA No ALQ 923 Unf 0 406 1329 GasA TA Y SBrkr 1329 0 0 1329 0 0 1 1 3 1 Gd 6 Typ 0 NA Attchd 1958 Unf 1 312 TA TA Y 393 36 0 0 0 0 NA NA Gar2 12500 6 2010 WD Normal
1463 60 RL 74 13830 Pave NA IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 5 5 1997 1998 Gable CompShg VinylSd VinylSd None 0 TA TA PConc Gd TA No GLQ 791 Unf 0 137 928 GasA Gd Y SBrkr 928 701 0 1629 0 0 2 1 3 1 TA 6 Typ 1 TA Attchd 1997 Fin 2 482 TA TA Y 212 34 0 0 0 0 NA MnPrv NA 0 3 2010 WD Normal
1464 60 RL 78 9978 Pave NA IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 6 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20 TA TA PConc TA TA No GLQ 602 Unf 0 324 926 GasA Ex Y SBrkr 926 678 0 1604 0 0 2 1 3 1 Gd 7 Typ 1 Gd Attchd 1998 Fin 2 470 TA TA Y 360 36 0 0 0 0 NA NA NA 0 6 2010 WD Normal
1465 120 RL 43 5005 Pave NA IR1 HLS AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1992 1992 Gable CompShg HdBoard HdBoard None 0 Gd TA PConc Gd TA No ALQ 263 Unf 0 1017 1280 GasA Ex Y SBrkr 1280 0 0 1280 0 0 2 0 2 1 Gd 5 Typ 0 NA Attchd 1992 RFn 2 506 TA TA Y 0 82 0 0 144 0 NA NA NA 0 1 2010 WD Normal
1466 60 RL 75 10000 Pave NA IR1 Lvl AllPub Corner Gtl Gilbert Norm Norm 1Fam 2Story 6 5 1993 1994 Gable CompShg HdBoard HdBoard None 0 TA TA PConc Gd TA No Unf 0 Unf 0 763 763 GasA Gd Y SBrkr 763 892 0 1655 0 0 2 1 3 1 TA 7 Typ 1 TA Attchd 1993 Fin 2 440 TA TA Y 157 84 0 0 0 0 NA NA NA 0 4 2010 WD Normal

4 NA Values

There exist many variables with at least one missing value in train set.

naDF <- data.frame(sapply(train,function(x) sum(is.na(x))))
names(naDF) <- "NA_Count"
naDF$variable <- row.names(naDF)
naDF <- naDF %>% 
  filter(NA_Count>0) %>% 
  arrange(desc(NA_Count))

ggplot(naDF,aes(x=reorder(variable,NA_Count),y=NA_Count,fill=variable))+
  geom_col()+
  geom_text(aes(label=NA_Count), hjust=0)+
  coord_flip()+
  theme(legend.position = "none")+
  labs(title="NA Counts of Train Data",
       subtitle = "With at least 1 NA")+
  ylab("NA Count")+
  xlab("Train Variables")+
  ylim(0,1500)

4.1 Known NA Values

Most of the variables that contain NA values are variables where the NA value is known. For example, the Alley variable replaces the level of “none” with NA and the Basement Quality variable replaces the level of “no basement” with NA.

naDF <- naDF %>% 
  mutate(knownMissing=ifelse(variable %in% c("Alley","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PoolQC","Fence","MiscFeature"),"Known","Unknown"))

ggplot(naDF,aes(x=as.factor(knownMissing),fill=knownMissing))+
  geom_bar()+
  geom_text(stat="count",aes(label=..count..), vjust=-1)+
  xlab("Known vs Unknown NA's")+
  ylab("Count")+
  labs(title="Count of Variables with Known vs Unknown NA values")+
  guides(fill=guide_legend(title="Known vs Unknown"))+
  ylim(0,15)

The fourteen variables that have known missing values are Alley, Basement Quality, Basement Condition, Basement Exposure, Basement Finish Type1, Basement Finish Type2, Fire Place Quality, Garage Type, Garage Finish, Garage Quality, Garage Condition, Pool Quality, Fence, and Miscellaneous. The appropriate values will all be added to the train data now to replace the known NA values.

train[is.na(train$Alley),"Alley"] <- "none"

for(name in c("BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2")){
  train[is.na(train[,name]),name] <- "none"
}

train[is.na(train$FireplaceQu),"FireplaceQu"] <- "none"

for(name in c("GarageType","GarageFinish","GarageQual","GarageCond")){
  train[is.na(train[,name]),name] <- "none"
}

train[is.na(train$PoolQC),"PoolQC"] <- "none"

train[is.na(train$Fence),"Fence"] <- "none"

train[is.na(train$MiscFeature),"MiscFeature"] <- "none"

The same will be done for the test set.

test[is.na(test$Alley),"Alley"] <- "none"

for(name in c("BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2")){
  test[is.na(test[,name]),name] <- "none"
}

test[is.na(test$FireplaceQu),"FireplaceQu"] <- "none"

for(name in c("GarageType","GarageFinish","GarageQual","GarageCond")){
  test[is.na(test[,name]),name] <- "none"
}

test[is.na(test$PoolQC),"PoolQC"] <- "none"

test[is.na(test$Fence),"Fence"] <- "none"

test[is.na(test$MiscFeature),"MiscFeature"] <- "none"

4.2 Unknown NA Values

Only five variables have remaining NA values. These values will be calulated in the feature engineering section.

#creating na table
naDF <- data.frame(sapply(train,function(x) sum(is.na(x))))
names(naDF) <- "NA_Count"
naDF$variable <- row.names(naDF)
naDF <- naDF %>% 
  select(variable,NA_Count) %>% 
  filter(NA_Count>0)

ggplot(naDF,aes(x=variable,y=NA_Count,fill=variable))+
  geom_col(aes(x=reorder(variable,desc(NA_Count))))+
  geom_text(aes(label=NA_Count), vjust=-1)+
  xlab("Variable")+
  ylab("NA Count")+
  labs(title="Remaining NA Values")+
  theme(legend.position = "none")+
  ylim(0,275)

5 Correlation Plots

There exists strong correlations with some of the predictors with correlation values above .8.

numericPredictors <- train[complete.cases(train),c(-1,-81)] %>% select_if(is.numeric) #1 = ID 81=SalesPrice

correlations <- as.data.frame(as.table(cor(numericPredictors)))

correlations <- correlations %>% filter(Freq!=1) %>% arrange(desc(Freq)) %>% rename(correlation=Freq)

gt(correlations[seq(1,nrow(correlations),by=2),] %>% head(10))
Var1 Var2 correlation
GarageArea GarageCars 0.8396258
X1stFlrSF TotalBsmtSF 0.8358336
TotRmsAbvGrd GrLivArea 0.8244260
GarageYrBlt YearBuilt 0.8233292
GrLivArea X2ndFlrSF 0.6885976
BsmtFullBath BsmtFinSF1 0.6514960
TotRmsAbvGrd BedroomAbvGr 0.6502724
GarageYrBlt YearRemodAdd 0.6454552
YearRemodAdd YearBuilt 0.6228006
TotRmsAbvGrd X2ndFlrSF 0.6177579
corrplot(cor(numericPredictors))

One of the predictors from each pair of correlated predictors with a correlation value above .8 will be dropped. Since garage year built has eighty-one NA values it will be dropped from that pair. The choice of which one to drop from the remaining three pairs will be based off which predictor has a lower correlation with the dependent variable of Sales Price. The predictors that will be dropped are total rooms above ground, first floor sf, and garage area.

numericPredictors <- cbind(SalePrice=train[complete.cases(train),"SalePrice"],numericPredictors)

correlationWithSalePrice <- as.data.frame(sapply(numericPredictors,function(x) cor(x,numericPredictors$SalePrice,use="complete.obs")))

names(correlationWithSalePrice) <- "correlation"
correlationWithSalePrice$predictor <- row.names(correlationWithSalePrice)

correlationWithSalePrice %>% filter(predictor %in% c("GarageArea","GarageCars","X1stFlrSF","TotalBsmtSF","TotRmsAbvGrd","GrLivArea")) %>% arrange(desc(correlation))
##              correlation    predictor
## GrLivArea      0.7051392    GrLivArea
## GarageCars     0.6470931   GarageCars
## GarageArea     0.6193185   GarageArea
## TotalBsmtSF    0.6159898  TotalBsmtSF
## X1stFlrSF      0.6080921    X1stFlrSF
## TotRmsAbvGrd   0.5471478 TotRmsAbvGrd

6 Data Analysis

6.1 ggplot Custom Functions

ggplot function for continuous data:

#train with complete observations only

numericggplot <- function(variable){
  histo <- ggplot(train,aes(x=train[,variable]))+
    geom_histogram()+
    xlab(variable)+
    theme(axis.text=element_text(size=14),
          axis.title=element_text(size=14,face="bold"))+
    labs(title=variable)

  
  scattero <- ggplot(train,aes(x=train[,variable],y=SalePrice))+
    geom_point()+
    geom_smooth(method = "lm", se=FALSE, color="black")+
    xlab(variable)+
    ylab("SalePrice")+
    theme(axis.text=element_text(size=14),
          axis.title=element_text(size=14,face="bold"))+
    labs(title=variable)
  
grid.arrange(histo,scattero,ncol=2)
}

ggplot function for discrete data:

discreteggplot <- function(name){
    discretePlot <- ggplot(train,aes(x=as.factor(train[,name])))+
    geom_bar()+xlab(name)

discreteBox <- ggplot(train,aes(x=as.factor(train[,name]),y=SalePrice))+
    geom_boxplot()+xlab(name)
  
grid.arrange(discretePlot,discreteBox,ncol=2)
}

6.2 Dependent Variable

The dependent variable we are trying to predict is SalePrice. SalePrice represents the prices of houses in Ames, Iowa. It exists in the train data set, but not in the test data set. It is skewed to the right with some very high trailing values.

salePriceHisto <- ggplot(train,aes(x=SalePrice))+
  geom_histogram()+
  labs(title="SalePrice")

salePriceBox <- ggplot(train,aes(x=SalePrice))+
  geom_boxplot()+
  labs(title="SalePrice")

grid.arrange(salePriceHisto,salePriceBox,ncol=2)

By applying the log transformation the sale price becomes normally distributed.

salePriceHistoLog <- ggplot(train,aes(x=log(SalePrice)))+
  geom_histogram()+
  labs(title="SalePrice")+
  xlab("log(SalePrice)")

salePriceBoxLog <- ggplot(train,aes(x=log(SalePrice)))+
  geom_boxplot()+
  labs(title="SalePrice")+
  xlab("log(SalePrice)")

grid.arrange(salePriceHistoLog,salePriceBoxLog,ncol=2)

6.3 Numeric Predictors

6.3.1 Lot Frontage and Area

LotFrontage: Linear feet of street connected to a property. There appears to be two outliers with lot frontage above 300 including observations 935 and 1299. Due to the high number of missing values more sophisticated methods of imputation will have to be used.

numericggplot("LotFrontage")

kable(train %>% filter(LotFrontage>300))
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
935 20 RL 313 27650 Pave none IR2 HLS AllPub Inside Mod NAmes PosA Norm 1Fam 1Story 7 7 1960 2007 Flat Tar&Grv Wd Sdng Wd Sdng None 0 TA TA CBlock Gd TA Gd GLQ 425 Unf 0 160 585 GasA Ex Y SBrkr 2069 0 0 2069 1 0 2 0 4 1 Gd 9 Typ 1 Gd Attchd 1960 RFn 2 505 TA TA Y 0 0 0 0 0 0 none none none 0 11 2008 WD Normal 242000
1299 60 RL 313 63887 Pave none IR3 Bnk AllPub Corner Gtl Edwards Feedr Norm 1Fam 2Story 10 5 2008 2008 Hip ClyTile Stucco Stucco Stone 796 Ex TA PConc Ex TA Gd GLQ 5644 Unf 0 466 6110 GasA Ex Y SBrkr 4692 950 0 5642 2 0 2 1 3 1 Ex 12 Typ 3 Gd Attchd 2008 Fin 2 1418 TA TA Y 214 292 0 0 0 480 Gd none none 0 1 2008 New Partial 160000

LotArea: Lot size in square feet. Contains only four values above 100,000 including observations 250, 314, 336, and 707.

numericggplot("LotArea")

kable(train[train$LotArea>100000,])
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
250 250 50 RL NA 159000 Pave none IR2 Low AllPub CulDSac Sev ClearCr Norm Norm 1Fam 1.5Fin 6 7 1958 2006 Gable CompShg Wd Sdng HdBoard BrkCmn 472 Gd TA CBlock Gd TA Gd Rec 697 Unf 0 747 1444 GasA Gd Y SBrkr 1444 700 0 2144 0 1 2 0 4 1 Gd 7 Typ 2 TA Attchd 1958 Fin 2 389 TA TA Y 0 98 0 0 0 0 none none Shed 500 6 2007 WD Normal 277000
314 314 20 RL 150 215245 Pave none IR3 Low AllPub Inside Sev Timber Norm Norm 1Fam 1Story 7 5 1965 1965 Hip CompShg BrkFace BrkFace None 0 TA TA CBlock Gd TA Gd ALQ 1236 Rec 820 80 2136 GasW TA Y SBrkr 2036 0 0 2036 2 0 2 0 3 1 TA 8 Typ 2 Gd Attchd 1965 RFn 2 513 TA TA Y 0 0 0 0 0 0 none none none 0 6 2009 WD Normal 375000
336 336 190 RL NA 164660 Grvl none IR1 HLS AllPub Corner Sev Timber Norm Norm 2fmCon 1.5Fin 5 6 1965 1965 Gable CompShg Plywood Plywood None 0 TA TA CBlock TA TA Gd ALQ 1249 BLQ 147 103 1499 GasA Ex Y SBrkr 1619 167 0 1786 2 0 2 0 3 1 TA 7 Typ 2 Gd Attchd 1965 Fin 2 529 TA TA Y 670 0 0 0 0 0 none none Shed 700 8 2008 WD Normal 228950
707 707 20 RL NA 115149 Pave none IR2 Low AllPub CulDSac Sev ClearCr Norm Norm 1Fam 1Story 7 5 1971 2002 Gable CompShg Plywood Plywood Stone 351 TA TA CBlock Gd TA Gd GLQ 1219 Unf 0 424 1643 GasA TA Y SBrkr 1824 0 0 1824 1 0 2 0 2 1 Gd 5 Typ 2 TA Attchd 1971 Unf 2 739 TA TA Y 380 48 0 0 0 0 none none none 0 6 2007 WD Normal 302000

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.2 Misc Value

MiscVal: Value of miscellaneous features. Most have a value of zero inferring no miscellaneous feature. Given that the value is not zero, the average miscellaneous value is 1221.

numericggplot("MiscVal")

miscValTable <- as.data.frame(table(train$MiscVal))
names(miscValTable) <- c("value","frequency")
kable(miscValTable)
value frequency
0 1408
54 1
350 1
400 11
450 4
480 2
500 8
560 1
600 4
620 1
700 5
800 1
1150 1
1200 2
1300 1
1400 1
2000 4
2500 1
3500 1
8300 1
15500 1
#average price of misc. value given not 0
round(mean(train[!is.na(train$MiscVal) & train$MiscVal!=0,"MiscVal"]),1)
## [1] 1221

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.3 Year Built, Year Remodeled, Garage Year Built

YearBuilt: Original construction date. The newer the house, the higher the house prices are. However, there are a couple houses built before 1900 that are very expensize. Investigating we can see that all four of these houses had a recent remodel. A predictor will be added for houses that have a remodel year that is not equal to the year the house was built.

numericggplot("YearBuilt")

kable(train[train$SalePrice>200000 & train$YearBuilt<1900 & !is.na(train$SalePrice),c("YearBuilt","YearRemodAdd","SalePrice")])
YearBuilt YearRemodAdd SalePrice
186 1892 1993 475000
305 1880 2002 295000
584 1893 2000 325000
748 1880 2003 265979

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions). Fifty-two percent of houses did not have a remodel.

numericggplot("YearRemodAdd")

print(paste(round(sum(train$YearRemodAdd==train$YearBuilt)/nrow(train)*100,0),"% of houses did not have a remodel."))
## [1] "52 % of houses did not have a remodel."

GarageYrBlt: Year garage was built. 75% of the time the garage was built when the house was built. There will be an added feature later on to depict whether or not the garage was added after the house was built. All eighty-one of the NA values in garage year built are due to the fact that the house does not have a garage.

numericggplot("GarageYrBlt")

#percentage of time garage was built when house was built
round(sum(train$GarageYrBlt==train$YearBuilt,na.rm=TRUE)/nrow(train),2)
## [1] 0.75
#comparing garage year built with house year built
ggplot(train,aes(x=GarageYrBlt-YearBuilt))+
  geom_histogram()+
  xlab("Difference in Time")+
  labs(title="Difference Between Year House was Built and Year Garage was Built")

table(train[is.na(train$GarageYrBlt),"GarageYrBlt"],train[is.na(train$GarageYrBlt),"GarageFinish"],exclude=FALSE)
##       
##        none
##   <NA>   81

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.4 Garage Cars and Area

GarageCars: Size of garage in car capacity.

discreteggplot("GarageCars")

GarageArea: Size of garage in square feet.

There exists a strong correlation between garage cars and garage area which makes sense. One of the predictors will likely be dropped later on.

numericggplot("GarageArea")

ggplot(train,aes(x=as.factor(GarageCars),y=GarageArea))+
  geom_boxplot()+
  xlab("Number of Cars in Garage")+
  labs(title="Garage Area vs Garage Cars",
       subtitle=paste("Correlation:",round(cor(train$GarageCars,train$GarageArea),2)))

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.5 MasVnrArea and MasVnrType

MasVnrArea: Masonry veneer area in square feet. Most of the values are zero.

numericggplot("MasVnrArea")

MasVnrType: Masonry veneer type. There are two observations that have a masonry vaneer type listed, but have a masonry veneer area of zero. There are also five cases where the Masonry Vaneer Type is none but the Masonry Vaneer Area is not zero. These will be addressed in feature engineering section by replacing these values with their median.

discreteggplot("MasVnrType")

kable(train %>% filter(MasVnrArea==0 & !MasVnrType=="None") %>% select(Id,MasVnrArea,MasVnrType))
Id MasVnrArea MasVnrType
689 0 BrkFace
1242 0 Stone
train[complete.cases(train$MasVnrType),] %>% group_by(MasVnrType) %>% summarize(medianMasVnrArea=median(MasVnrArea,na.rm=TRUE))
## # A tibble: 4 × 2
##   MasVnrType medianMasVnrArea
##   <chr>                 <dbl>
## 1 BrkCmn                 192 
## 2 BrkFace                202 
## 3 None                     0 
## 4 Stone                  206.
kable(train %>% filter(MasVnrArea!=0 & MasVnrType=="None") %>% select(Id,MasVnrArea,MasVnrType))
Id MasVnrArea MasVnrType
625 288 None
774 1 None
1231 1 None
1301 344 None
1335 312 None

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.6 Basement Area Predictors

BsmtFinSF1: Type 1 finish square feet. Observation 1299 is an outlier with the basement type1 square feet being above 3000.

numericggplot("BsmtFinSF1")

kable(train %>% filter(BsmtFinSF1>3000) %>% select(Id,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF))
Id BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
1299 5644 0 466 6110

BsmtFinSF2: Type 2 finished square feet.

numericggplot("BsmtFinSF2")

BsmtUnfSF: Unfinished square feet of basement area.

numericggplot("BsmtUnfSF")

TotalBsmtSF: Total square feet of basement area. Is the same as the sum of finished square feet 1, finished square feet 2, and unfinished square feet. There is one outlier with a total basement square feet above 6000 which is observation 1299. However, the sum of finished square feet 1, finished square feet 2, and unfinished square feet is equal to the total sum of square feet basement area for each observation meaning no typo errors.

numericggplot("TotalBsmtSF")

sum(train$TotalBsmtSF!=train$BsmtFinSF1+train$BsmtFinSF2+train$BsmtUnfSF)
## [1] 0

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.7 Upstairs SF Area

X1stFlrSF: First Floor square feet.

numericggplot("X1stFlrSF")

X2ndFlrSF: Second floor square feet.

numericggplot("X2ndFlrSF")

LowQualFinSF: Low quality finished square feet (all floors). Only twenty-six houses in the train data set have low quality finished square feet.

numericggplot("LowQualFinSF")

train %>% filter(LowQualFinSF>0) %>% count()
##    n
## 1 26

GrLivArea: Above grade (ground) living area square feet.

First floor square feet + second floor square feet + low quality square feet = above ground living area for each observation. Observation 1299 comes up again as an outlier, will likely be dropped in feature engineering stage.

numericggplot("GrLivArea")

sum(train$X1stFlrSF+train$X2ndFlrSF+train$LowQualFinSF!=train$GrLivArea)/nrow(train)
## [1] 0
kable(train[train$GrLivArea>5000,c("Id","SalePrice","GrLivArea","LowQualFinSF","X2ndFlrSF","X1stFlrSF")])
Id SalePrice GrLivArea LowQualFinSF X2ndFlrSF X1stFlrSF
1299 1299 160000 5642 0 950 4692

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.8 Number of Bathrooms

BsmtFullBath: Basement full bathrooms. Only one house has three full bathrooms in the basement. Although it appears unlikely for a basement of that size to have three full bathrooms, it is not entirely impossible. We can analyze this from the box-plot where a basement of a similar size has two full basement bathrooms.

discreteggplot("BsmtFullBath")

ggplot(train,aes(x=as.factor(BsmtFullBath),y=TotalBsmtSF))+
  geom_boxplot()+
  xlab("Number of Full Basement Bathrooms")+
  labs(title="Total Basement sq.ft vs Number of Basement Full Bathrooms")

BsmtHalfBath: Basement half bathrooms.

discreteggplot("BsmtHalfBath")

FullBath: Full bathrooms above grade.

discreteggplot("FullBath")

HalfBath: Half baths above grade.

discreteggplot("HalfBath")

There is no variable for the total bathrooms in a house. This will be added later on in the feature engineering section.

train %>% 
  mutate(totalBathrooms=BsmtFullBath+BsmtHalfBath+HalfBath+FullBath,na.rm=TRUE) %>% 
  ggplot(.,aes(x=as.factor(totalBathrooms)))+
  geom_bar()+
  xlab("Total Bathooms")+
  ylab("Count of Bathrooms")+
  labs(title="Total Bathrooms")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.9 Bedrooms, Kitchens, and Total Rooms

BedroomAbvGr: Bedrooms above ground (does NOT include basement bedrooms).

discreteggplot("BedroomAbvGr")

KitchenAbvGr: Kitchens above ground.

discreteggplot("KitchenAbvGr")

TotRmsAbvGrd: Total rooms above ground (does not include bathrooms).

discreteggplot("TotRmsAbvGrd")

There will be a feature added later on for rooms that are neither kitchen or bedrooms.

train %>% mutate(otherRooms=TotRmsAbvGrd-BedroomAbvGr-KitchenAbvGr) %>% 
  ggplot(.,aes(x=otherRooms))+
  geom_bar()+xlab("Other Rooms")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.10 Fireplaces

Fireplaces: Number of fireplaces. Nearly all houses have either one or two fireplaces.

discreteggplot("Fireplaces")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.11 Porch Variables

OpenPorchSF: Open porch area in square feet.

numericggplot("OpenPorchSF")

EnclosedPorch: Enclosed porch area in square feet.

numericggplot("EnclosedPorch")

X3SsnPorch: Three season porch area in square feet.

numericggplot("X3SsnPorch")

ScreenPorch: Screen porch area in square feet.

numericggplot("ScreenPorch")

A predictor will be made later for whether or not a house has a porch. Most houses have a porch.

train %>% mutate(hasPorch=ifelse(ScreenPorch>0 | X3SsnPorch>0 | EnclosedPorch>0 | OpenPorchSF>0,1,0)) %>% 
  ggplot(.,aes(x=as.factor(hasPorch)))+
  geom_bar()+
  xlab("Has a Porch")

Could also add a predictor for the total number of porch square feet.

train %>% mutate(totalPorchArea=ScreenPorch+X3SsnPorch+EnclosedPorch+OpenPorchSF) %>% 
  ggplot(.,aes(x=totalPorchArea))+
  geom_histogram()

WoodDeckSF: Wood deck area in square feet.

numericggplot("WoodDeckSF")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.3.12 Pool Area and Quality

PoolArea: Pool area in square feet.

numericggplot("PoolArea")

PoolQC: Quality of the pool.

discreteggplot("PoolQC")

6.4 Date/Time Predictors

MoSold: Month Sold (MM).

Most houses seem to sell in the summer months, but the average house price per month sold seems pretty even.

moSold1 <- ggplot(train,aes(x=as.character(MoSold)))+
  geom_bar()+
  scale_x_discrete(limits=seq(1,12))+
  xlab("Months")

moSold2 <- ggplot(train[!is.na(train$SalePrice),],aes(x=as.character(MoSold),y=SalePrice))+
  geom_boxplot()+
  scale_x_discrete(limits=seq(1,12))+
  xlab("Months")

grid.arrange(moSold1,moSold2,ncol=2)

YrSold: Year Sold (YYYY).

From the previous charts we can see the end of the range of observations is in August 2010, which explains why 2010 has the least number of houses sold. There appears to be a gradual decrease in the median price of a house sold from 2006 to 2010. Both month sold and year sold will be converted to character predictors as there is no apparent ordering of sale price within them.

discreteggplot("YrSold")

6.5 Ordered Predictors

6.5.1 Overall Quality

OverallQual: Rates the overall material and finish of the house

  • 10 Very Excellent
  • 9 Excellent
  • 8 Very Good
  • 7 Good
  • 6 Above Average
  • 5 Average
  • 4 Below Average
  • 3 Fair
  • 2 Poor
  • 1 Very Poor

There is a clear increase in the sale prices as the quality of the house increases.

overallqualBar <- ggplot(train,aes(x=OverallQual))+
  geom_bar()+
  scale_x_discrete(limits=seq(1,10))+
  xlab("Overall Quality")

overallqualBox <- ggplot(train,aes(x=as.factor(OverallQual),y=SalePrice))+
  geom_boxplot()+
  scale_x_discrete(limits=seq(1,10))+
  xlab("Overall Quality")

grid.arrange(overallqualBar,overallqualBox,ncol=2)

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.2 Overall Condition

OverallCond: Rates the overall condition of the house.

  • 10 Very Excellent
  • 9 Excellent
  • 8 Very Good
  • 7 Good
  • 6 Above Average
  • 5 Average
  • 4 Below Average
  • 3 Fair
  • 2 Poor
  • 1 Very Poor

There is one outlier for the sale price when the overall condition is 2 and another outlier when the overall condition is 6. These observations are 379 and 692.

ggplot(train,aes(x=as.character(OverallCond),y=SalePrice))+
  geom_boxplot()+
  scale_x_discrete(limits=seq(1,10))+
  xlab("Overall Condition")

kable(train[train$OverallCond==2 & train$SalePrice>350000 & !is.na(train$SalePrice),])
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
379 379 20 RL 88 11394 Pave none Reg Lvl AllPub Corner Gtl StoneBr Norm Norm 1Fam 1Story 9 2 2010 2010 Hip CompShg VinylSd VinylSd Stone 350 Gd TA PConc Ex TA Av GLQ 1445 Unf 0 411 1856 GasA Ex Y SBrkr 1856 0 0 1856 1 0 1 1 1 1 Ex 8 Typ 1 Ex Attchd 2010 Fin 3 834 TA TA Y 113 0 0 0 0 0 none none none 0 6 2010 New Partial 394432
kable(train[train$OverallCond==6 & train$SalePrice>600000 & !is.na(train$SalePrice),])
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
692 692 60 RL 104 21535 Pave none IR1 Lvl AllPub Corner Gtl NoRidge Norm Norm 1Fam 2Story 10 6 1994 1995 Gable WdShngl HdBoard HdBoard BrkFace 1170 Ex TA PConc Ex TA Gd GLQ 1455 Unf 0 989 2444 GasA Ex Y SBrkr 2444 1872 0 4316 0 1 3 1 4 1 Ex 10 Typ 2 Ex Attchd 1994 Fin 3 832 TA TA Y 382 50 0 0 0 0 none none none 0 1 2007 WD Normal 755000

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.3 External Quality and Condition

ExterQual: Evaluates the quality of the material on the exterior.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • Po Poor
discreteggplot("ExterQual")

ExterCond: Evaluates the present condition of the material on the exterior. External quality and external condition will both be made into ordered predictors.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • Po Poor
discreteggplot("ExterCond")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.4 Basement Quality and Condition

BsmtQual: Evaluates the height of the basement.

  • Ex Excellent (100+ inches)
  • Gd Good (90-99 inches)
  • TA Typical (80-89 inches)
  • Fa Fair (70-79 inches)
  • Po Poor (<70 inches
  • NA No Basement
discreteggplot("BsmtQual")

BsmtCond: Evaluates the general condition of the basement. Both basement quality and basement condition will be made into ordered predictors.

  • Ex Excellent
  • Gd Good
  • TA Typical - slight dampness allowed
  • Fa Fair - dampness or some cracking or settling
  • Po Poor - Severe cracking, settling, or wetness
  • NA No Basement
discreteggplot("BsmtCond")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.5 Kitchen Quality

KitchenQual: Kitchen quality. Will be made into ordered predictor.

  • Ex Excellent
  • Gd Good
  • TA Typical/Average
  • Fa Fair
  • Po Poor
discreteggplot("KitchenQual")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.6 Fireplace Quality

FireplaceQu: Fireplace quality.

  • Ex Excellent - Exceptional Masonry Fireplace
  • Gd Good - Masonry Fireplace in main level
  • TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
  • Fa Fair - Prefabricated Fireplace in basement
  • Po Poor - Ben Franklin Stove
  • NA No Fireplace
discreteggplot("FireplaceQu")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.7 Garage Quality and Condition

GarageQual: Garage quality.

  • Ex Excellent
  • Gd Good
  • TA Typical/Average
  • Fa Fair
  • Po Poor
  • NA No Garage
discreteggplot("GarageQual")

GarageCond: Garage condition. Both garage quality and condition and will be made into ordered predictors.

  • Ex Excellent
  • Gd Good
  • TA Typical/Average
  • Fa Fair
  • Po Poor
  • NA No Garage
discreteggplot("GarageCond")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.8 Pool Quality

PoolQC: Pool quality.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • NA No Pool
discreteggplot("PoolQC")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.9 Street

Street: Type of road access to property.

  • Grvl Gravel
  • Pave Paved

Can be made into ordered predictor with Paved>Gravel.

discreteggplot("Street")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.10 LandSlope

LandSlope: Slope of property.

  • Gtl Gentle slope
  • Mod Moderate Slope
  • Sev Severe Slope

Made into ordered with Gtl>Mod>Sev.

discreteggplot("LandSlope")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.11 Heating quality, Central Air, Electrical

HeatingQC: Heating quality and condition.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • Po Poor

Heating quality, central air, and electrical will all be changed into ordered.

discreteggplot("HeatingQC")

CentralAir: Central air conditioning.

  • N No
  • Y Yes
discreteggplot("CentralAir")

Electrical: Electrical system. Since Electrical is dominated by level of Circuit breaker the NA value will be replaced with this value in the feature engineering section.

  • SBrkr Standard Circuit Breakers & Romex
  • FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
  • FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
  • FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
  • Mix Mixed
discreteggplot("Electrical")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.12 Masonry Veneer Type

MasVnrType: Masonry veneer type.

  • BrkCmn Brick Common
  • BrkFace Brick Face
  • CBlock Cinder Block
  • None None
  • Stone Stone

Will be made into an ordered predictor with the sequential order of None<Brick Common<Brick Face<Stone. Cinder block is not in the data set so it will not be included, not sure where it would rank.

discreteggplot("MasVnrType")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.13 Basement Exposure

BsmtExposure: Refers to walkout or garden level walls.

  • Gd Good Exposure
  • Av Average Exposure (split levels or foyers typically score average or above)
  • Mn Mimimum Exposure
  • No No Exposure
  • NA No Basement

Can be made into ordered based on level of exposure.

discreteggplot("BsmtExposure")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.14 Garage Finish

GarageFinish: Interior finish of the garage.

  • Fin Finished
  • RFn Rough Finished
  • Unf Unfinished
  • NA No Garage

Garage finish will be made into ordered.

discreteggplot("GarageFinish")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.15 Paved Drive

PavedDrive: Paved driveway.

  • Y Paved
  • P Partial Pavement
  • N Dirt/Gravel
discreteggplot("PavedDrive")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.5.16 Fence

Fence: Fence quality.

  • GdPrv Good Privacy
  • MnPrv Minimum Privacy
  • GdWo Good Wood
  • MnWw Minimum Wood/Wire
  • NA No Fence

It appears having no fence translates to a higher sales price than having a fence. Will convert this predictor to ordered with having no fence having the highest median sales price.

discreteggplot("Fence")

6.6 Categorical Predictors

6.6.1 MSSubClass

MSSubClass: Identifies the type of dwelling involved in the sale.

  • 20 1-STORY 1946 & NEWER ALL STYLES
  • 30 1-STORY 1945 & OLDER
  • 40 1-STORY W/FINISHED ATTIC ALL AGES
  • 45 1-1/2 STORY - UNFINISHED ALL AGES
  • 50 1-1/2 STORY FINISHED ALL AGES
  • etc (11 more labels)

The type of dwelling is currently a numeric predictor, it should be a factor.

discreteggplot("MSSubClass")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.2 MSZoning

MSZoning: Identifies the general zoning classification of the sale.

  • A Agriculture
  • C Commercial
  • FV Floating Village Residential
  • I Industrial
  • RH Residential High Density
  • RL Residential Low Density
  • RP Residential Low Density Park
  • RM Residential Medium Density
discreteggplot("MSZoning")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.3 Alley

Alley: Type of alley access to property.

  • Grvl Gravel
  • Pave Paved
  • NA No alley access
discreteggplot("Alley")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.4 Lotshape

LotShape: General shape of property.

  • Reg Regular
  • IR1 Slightly irregular
  • IR2 Moderately Irregular
  • IR3 Irregular

Could be an ordered predictor, box-plot doesn’t support this claim though so leaving as character.

discreteggplot("LotShape")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.5 Land Contour

LandContour: Flatness of the property.

  • Lvl Near Flat/Level
  • Bnk Banked - Quick and significant rise from street grade to building
  • HLS Hillside - Significant slope from side to side
  • Low Depression
discreteggplot("LandContour")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.6 Lot Config.

LotConfig: Lot configuration.

  • Inside Inside lot
  • Corner Corner lot
  • CulDSac Cul-de-sac
  • FR2 Frontage on 2 sides of property
  • FR3 Frontage on 3 sides of property
discreteggplot("LotConfig")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.7 Neighborhood

Neighborhood: Physical locations within Ames city limits.

  • Blmngtn Bloomington Heights
  • Blueste Bluestem
  • BrDale Briardale
  • BrkSide Brookside
  • etc (21 other labels)

There exists twenty-five possible labels for neighborhood.


discreteggplot("Neighborhood")

train %>% group_by(Neighborhood) %>% summarize(count=n(),avgSalePrice=mean(SalePrice)) %>% arrange(desc(count))
## # A tibble: 25 × 3
##    Neighborhood count avgSalePrice
##    <chr>        <int>        <dbl>
##  1 NAmes          225      145847.
##  2 CollgCr        150      197966.
##  3 OldTown        113      128225.
##  4 Edwards        100      128220.
##  5 Somerst         86      225380.
##  6 Gilbert         79      192855.
##  7 NridgHt         77      316271.
##  8 Sawyer          74      136793.
##  9 NWAmes          73      189050.
## 10 SawyerW         59      186556.
## # … with 15 more rows

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.8 Condition1 and Condition2

Condition1: Proximity to various conditions.

  • Artery Adjacent to arterial street
  • Feedr Adjacent to feeder street
  • Norm Normal
  • RRNn Within 200’ of North-South Railroad
  • RRAn Adjacent to North-South Railroad
  • PosN Near positive off-site feature–park, greenbelt, etc.
  • PosA Adjacent to postive off-site feature
  • RRNe Within 200’ of East-West Railroad
  • RRAe Adjacent to East-West Railroad
discreteggplot("Condition1")

table(train$Condition1)
## 
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNe   RRNn 
##     48     81   1260      8     19     11     26      2      5

Condition2: Proximity to various conditions (if more than one is present).

  • Artery Adjacent to arterial street
  • Feedr Adjacent to feeder street
  • Norm Normal
  • RRNn Within 200’ of North-South Railroad
  • RRAn Adjacent to North-South Railroad
  • PosN Near positive off-site feature–park, greenbelt, etc.
  • PosA Adjacent to postive off-site feature
  • RRNe Within 200’ of East-West Railroad
  • RRAe Adjacent to East-West Railroad

87% of the houses have the same condition1 and condition2 value, one of the columns will likely be dropped.

discreteggplot("Condition2")

round(sum((train$Condition1==train$Condition2)/nrow(train)),2)
## [1] 0.87

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.9 Building Type and House Style

BldgType: Type of dwelling.

  • 1Fam Single-family Detached
  • 2FmCon Two-family Conversion; originally built as one-family dwelling
  • Duplx Duplex
  • TwnhsE Townhouse End Unit
  • TwnhsI Townhouse Inside Unit
discreteggplot("BldgType")

HouseStyle: Style of dwelling.

  • 1Story One story
  • 1.5Fin One and one-half story: 2nd level finished
  • 1.5Unf One and one-half story: 2nd level unfinished
  • 2Story Two story
  • 2.5Fin Two and one-half story: 2nd level finished
  • 2.5Unf Two and one-half story: 2nd level unfinished
  • SFoyer Split Foyer
  • SLvl Split Level

House Style has some rare labels that may require grouping.

discreteggplot("HouseStyle")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.10 Roof Style and Material

RoofStyle: Type of roof.

  • Flat Flat
  • Gable Gable
  • Gambrel Gabrel (Barn)
  • Hip Hip
  • Mansard Mansard
  • Shed Shed
discreteggplot("RoofStyle")

RoofMatl: Roof material.

  • ClyTile Clay or Tile
  • CompShg Standard (Composite) Shingle
  • Membran Membrane
  • Metal Metal
  • Roll Roll
  • Tar&Grv Gravel & Tar
  • WdShake Wood Shakes
  • WdShngl Wood Shingles

Roof material has four levels that only occur once. These levels will either need to be dropped or grouped.

discreteggplot("RoofMatl")

table(train$RoofMatl)
## 
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    1434       1       1       1      11       5       6

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.11 Exterior1st and 2nd

Exterior1st: Exterior covering on house.

  • AsbShng Asbestos Shingles
  • AsphShn Asphalt Shingles
  • BrkComm Brick Common
  • BrkFace Brick Face
  • CBlock Cinder Block
  • etc (12 other levels)
exterior1stBar <- ggplot(train,aes(x=Exterior1st))+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45))
  

exterior1stBox <- ggplot(train[!is.na(train$SalePrice),],aes(x=Exterior1st,y=SalePrice))+
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 45))

grid.arrange(exterior1stBar,exterior1stBox, ncol=2)

Exterior2nd: Exterior covering on house (if more than one material).

  • AsbShng Asbestos Shingles
  • AsphShn Asphalt Shingles
  • BrkComm Brick Common
  • BrkFace Brick Face
  • CBlock Cinder Block
  • etc (12 other levels)

Most of the time exterior1st is the same as exterior2nd, one of the predictors will likely be dropped.

exterior2ndBar <- ggplot(train,aes(x=Exterior2nd))+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45))
  

exterior2ndBox <- ggplot(train[!is.na(train$SalePrice),],aes(x=Exterior2nd,y=SalePrice))+
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 45))

grid.arrange(exterior2ndBar,exterior2ndBox, ncol=2)

round(sum(train$Exterior1st==train$Exterior2nd)/nrow(train),2)
## [1] 0.85

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.12 Foundation

Foundation: Type of foundation.

  • BrkTil Brick & Tile
  • CBlock Cinder Block
  • PConc Poured Contrete
  • Slab Slab
  • Stone Stone
  • Wood Wood
discreteggplot("Foundation")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.13 BsmtFinType 1 and 2

BsmtFinType1: Rating of basement finished area.

  • GLQ Good Living Quarters
  • ALQ Average Living Quarters
  • BLQ Below Average Living Quarters
  • Rec Average Rec Room
  • LwQ Low Quality
  • Unf Unfinshed
  • NA No Basement
discreteggplot("BsmtFinType1")

BsmtFinType2: Rating of basement finished area (if multiple types).

  • GLQ Good Living Quarters
  • ALQ Average Living Quarters
  • BLQ Below Average Living Quarters
  • Rec Average Rec Room
  • LwQ Low Quality
  • Unf Unfinished
  • NA No Basement

From the plots it is shown that the sale price did not increase as would be expected when moving from no basement to good living quarters. Because of this, these two predictors will remain character instead of being made into ordered. Almost all of basement finish type2 are of the unfinished class.

discreteggplot("BsmtFinType2")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.14 Heating

Heating: Type of heating.

  • Floor Floor Furnace
  • GasA Gas forced warm air furnace
  • GasW Gas hot water or steam heat
  • Grav Gravity furnace
  • OthW Hot water or steam heat other than gas
  • Wall Wall furnace

Nearly all houses have gas forced heating.

discreteggplot("Heating")

table(train$Heating)
## 
## Floor  GasA  GasW  Grav  OthW  Wall 
##     1  1428    18     7     2     4

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.15 Functional

Functional: Home functionality (Assume typical unless deductions are warranted).

  • Typ Typical Functionality
  • Min1 Minor Deductions 1
  • Min2 Minor Deductions 2
  • Mod Moderate Deductions
  • Maj1 Major Deductions 1
  • Maj2 Major Deductions 2
  • Sev Severely Damaged
  • Sal Salvage only

Nearly all values are typical.

discreteggplot("Functional")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.16 Garage Type

GarageType: Garage location.

  • 2Types More than one type of garage
  • Attchd Attached to home
  • Basment Basement Garage
  • BuiltIn Built-In (Garage part of house - typically has room above garage)
  • CarPort Car Port
  • Detchd Detached from home
  • NA No Garage
discreteggplot("GarageType")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.17 Misc Feature

MiscFeature: Miscellaneous feature not covered in other categories.

  • Elev Elevator
  • Gar2 2nd Garage (if not described in garage section)
  • Othr Other
  • Shed Shed (over 100 SF)
  • TenC Tennis Court
  • NA None

96% of houses do not have a miscellaneous feature.

discreteggplot("MiscFeature")

round(sum(train$MiscFeature=="none")/nrow(train),2)
## [1] 0.96

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.18 Sale Type and Condition

SaleType: Type of sale.

  • WD Warranty Deed - Conventional
  • CWD Warranty Deed - Cash
  • VWD Warranty Deed - VA Loan
  • New Home just constructed and sold
  • COD Court Officer Deed/Estate
  • Con Contract 15% Down payment regular terms
  • ConLw Contract Low Down payment and low interest
  • ConLI Contract Low Interest
  • ConLD Contract Low Down
  • Oth Other
discreteggplot("SaleType")

SaleCondition: Condition of sale.

  • Normal Normal Sale
  • Abnorml Abnormal Sale - trade, foreclosure, short sale
  • AdjLand Adjoining Land Purchase
  • Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
  • Family Sale between family members
  • Partial Home was not completed when last assessed (associated with New Homes)

Both are dominated by one label and have a number of rare labels.

discreteggplot("SaleCondition")

7 Test Data Analysis

MasVnrArea and MasVnrType:

In kaggle competitions we have access to the test set so this allows us to take a peak for any possible observations with data entry errors or weird values. The masonry veneer area and type exhibit the same problems that occurred in the train set.

test %>% filter(MasVnrArea!=0 & MasVnrType=="None") %>% select(Id,MasVnrArea,MasVnrType)
##     Id MasVnrArea MasVnrType
## 1 1670        285       None
## 2 2453          1       None
test %>% filter(MasVnrArea==0 & MasVnrType!="None") %>% select(Id,MasVnrArea,MasVnrType)
##     Id MasVnrArea MasVnrType
## 1 2320          0    BrkFace

8 Baseline Model

Now that the data analysis portion is over a random forest model will be run to develop a baseline score in which we hope to beat.

Train Control:

custom_summary = function(data, lev = NULL, model = NULL) {
library(Metrics)
out = rmsle(data[, "obs"], data[, "pred"])
names(out) = c("rmsle")
out
}
ctrl <- trainControl(method="cv",
                     number=10,
                     summaryFunction = custom_summary,
                     allowParallel = TRUE)

Removing near zero variances.

noVariance <- nearZeroVar(train)
noNearZeroVar <- train[,-noVariance]

Random Forest Model:

set.seed(123)
rfBase <- train(SalePrice~.,data=noNearZeroVar[complete.cases(train),-1],
                trControl=ctrl,
                metric="rmsle",
                maximize=FALSE,
                tuneGrid=expand.grid(mtry=seq(20,40,by=5)))

saveRDS(rfBase,"rfBase.rds")
rfBase <- readRDS("rfBase.rds")
min(rfBase$results$rmsle)
## [1] 0.1394914

Variable Importance:

From the baseline random forest model we are also able to gather the variable importance in predicting sales price. The top 10 listed important variables are overall quality, above ground living area, garage cars, external quality, total basement square feet, kitchen quality, 1st floor square feet, garage area, year built, and basement quality.

varImp(rfBase)
## rf variable importance
## 
##   only 20 most important variables shown (out of 193)
## 
##                 Overall
## OverallQual     100.000
## GrLivArea        50.372
## GarageCars       36.472
## ExterQualTA      32.607
## TotalBsmtSF      32.597
## YearBuilt        30.442
## X1stFlrSF        24.929
## GarageArea       22.189
## X2ndFlrSF        16.169
## BsmtFinSF1       15.273
## FullBath         11.267
## LotArea          10.280
## GarageYrBlt       9.800
## TotRmsAbvGrd      9.520
## YearRemodAdd      7.377
## MasVnrArea        6.168
## LotFrontage       5.326
## FoundationPConc   5.229
## KitchenQualTA     4.664
## ExterQualGd       4.451

9 Feature Engineering

Takeaways from the data analysis section:

Mode function:

mode <- function(x){
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

MasVnrArea and MasVnrType:

Masonry veneer area and type have errors such as an area of zero when a type is listed or an area above zero when listed as none. These issues will fixed by taking the median values for the masonry veneer area and the second most common masonry veneer type which is brick face for masonry veneer type.

train[train$Id==689,"MasVnrArea"] <- median(train[train$MasVnrType=="BrkFace" & !is.na(train$MasVnrArea),"MasVnrArea"])

train[train$Id==1242,"MasVnrArea"] <- median(train[train$MasVnrType=="Stone" & !is.na(train$MasVnrArea),"MasVnrArea"])

train[train$Id%in%c(774,1231),"MasVnrArea"] <- 0

train[train$Id %in% c(625,1301,1335) & train$MasVnrType=="None","MasVnrType"] <- "BrkFace"

Applying the same to the test set.

test[test$Id==2453,"MasVnrArea"] <- 0
test[test$Id==1670,"MasVnrType"] <- "BrkFace"
test[test$Id==2320,"MasVnrArea"] <- median(train[train$MasVnrType=="BrkFace" & !is.na(train$MasVnrArea),"MasVnrArea"])

Imputation:

For Electrical and MasVnrType mode imputation will be used because there are only a total on nine NA values between the two. Any changes applied to the train data set should also be applied to the test.

test[is.na(test$Electrical),"Electrical"] <- mode(train[!is.na(train$Electrical),"Electrical"])

test[is.na(test$MasVnrType),"MasVnrType"] <- mode(train[!is.na(train$MasVnrType),"MasVnrType"])
train[is.na(train$Electrical),"Electrical"] <- mode(train[!is.na(train$Electrical),"Electrical"])

train[is.na(train$MasVnrType),"MasVnrType"] <- mode(train[!is.na(train$MasVnrType),"MasVnrType"])

For MasVnrArea there are only eight NA values, median imputation will be used.

test[is.na(test$MasVnrArea),"MasVnrArea"] <- 
  median(train[!is.na(train$MasVnrArea),"MasVnrArea"])

train[is.na(train$MasVnrArea),"MasVnrArea"] <- median(train[!is.na(train$MasVnrArea),"MasVnrArea"])

KNN imputation will be used for lot frontage. Since KNN imputation center and scales the data the reverse of the operations will be applied to change lot frontage back to its original form.

library(RANN)
lotFrontageMean <- mean(train[!is.na(train$LotFrontage),"LotFrontage"])
sdLotFrontage <- sd(train[!is.na(train$LotFrontage),"LotFrontage"])

imputeMissing <- preProcess(train[,c(-1,-81)],"knnImpute") # 1=ID, 81=SalePrice
imputedMissingTrain <- predict(imputeMissing,train)

# unstandardizing the data
train$LotFrontage <- (imputedMissingTrain$LotFrontage*sdLotFrontage)+lotFrontageMean

Same imputation method will be used for the test set.

imputedMissingTest <- predict(imputeMissing,test)
# unstandardizing the data
test$LotFrontage <- (imputedMissingTest$LotFrontage*sdLotFrontage)+lotFrontageMean

Garage year built has intrinsic missing values which represent not having a garage. Since it is highly correlated with the year the house was built, imputation will not be needed as it will be dropped later on.

Test NAs:

There are remaining NA values in the test set. Looking at the basement and garage NA’s it appears those observations did not have a basement or garage so the NA values are replaced with zeros. For the remaining NA values mode imputation is used. Again we are skipping the year the garage was built because that variable will be dropped.

testNAs <- as.data.frame(sapply(test,function(x) sum(is.na(x))))
names(testNAs) <- "NA_Count"
testNAs$variable <- row.names(testNAs)
testNAs %>% arrange(desc(NA_Count)) %>% filter(NA_Count>0)
##              NA_Count     variable
## GarageYrBlt        78  GarageYrBlt
## MSZoning            4     MSZoning
## Utilities           2    Utilities
## BsmtFullBath        2 BsmtFullBath
## BsmtHalfBath        2 BsmtHalfBath
## Functional          2   Functional
## Exterior1st         1  Exterior1st
## Exterior2nd         1  Exterior2nd
## BsmtFinSF1          1   BsmtFinSF1
## BsmtFinSF2          1   BsmtFinSF2
## BsmtUnfSF           1    BsmtUnfSF
## TotalBsmtSF         1  TotalBsmtSF
## KitchenQual         1  KitchenQual
## GarageCars          1   GarageCars
## GarageArea          1   GarageArea
## SaleType            1     SaleType
test[,grepl("Bsmt",names(test))] %>% filter(is.na(BsmtFullBath))
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1     none     none         none         none         NA         none
## 2     none     none         none         none          0         none
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 1         NA        NA          NA           NA           NA
## 2          0         0           0           NA           NA
test[test$Id==2121,c("BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","BsmtFullBath","BsmtHalfBath")] <- 0
test[test$Id==2189,c("BsmtFullBath","BsmtHalfBath")] <- 0

test[,grepl("Garage",names(test))] %>% filter(is.na(GarageArea))
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Detchd          NA         none         NA         NA       none
##   GarageCond
## 1       none
test[test$Id==2577,c("GarageCars","GarageArea")] <- 0

test[is.na(test$MSZoning),"MSZoning"] <- mode(train$MSZoning)
test[is.na(test$Utilities),"Utilities"] <- mode(train$Utilities)
test[is.na(test$Functional),"Functional"] <- mode(train$Functional)
test[is.na(test$Exterior1st),"Exterior1st"] <- mode(train$Exterior1st)
test[is.na(test$Exterior2nd),"Exterior2nd"] <- mode(train$Exterior2nd)
test[is.na(test$KitchenQual),"KitchenQual"] <- mode(train$KitchenQual)
test[is.na(test$SaleType),"SaleType"] <- mode(train$SaleType)

Adding Predictors:

The predictors added to our data will be whether or not a house had a garage built after the house was built, total rooms that are neither bathrooms or kitchens, total number of bathrooms, whether a house has a porch, total porch sf, and whether or not a house had a remodel.

train <- train %>% 
  mutate(newGarage=ifelse(!is.na(GarageYrBlt) & YearBuilt!=GarageYrBlt ,1,0),
         otherRooms=TotRmsAbvGrd-FullBath-HalfBath,
         totalBathRoom=FullBath+HalfBath+BsmtFullBath+BsmtHalfBath,
         hasPorch=ifelse(WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch>0,1,0),
         totalPorchSF=WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch,
         hasRemodel=ifelse(YearBuilt!=YearRemodAdd,1,0))

test <- test %>% 
  mutate(newGarage=ifelse(!is.na(GarageYrBlt) & YearBuilt!=GarageYrBlt ,1,0),
         otherRooms=TotRmsAbvGrd-FullBath-HalfBath,
         totalBathRoom=FullBath+HalfBath+BsmtFullBath+BsmtHalfBath,
         hasPorch=ifelse(WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch>0,1,0),
         totalPorchSF=WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch,
         hasRemodel=ifelse(YearBuilt!=YearRemodAdd,1,0))

When looking at the year the house was built compared to the sales price it appears before a certain period the age of the house does not matter. This point is found using a mars model. Houses built before 1971 have less affect for a one year increase in the age of the house on sale price compared to a house built after 1971. A newly created variable of ‘OldHouse’ will be used to address this.

# Building mars model to find hinge point
yearBuiltBinned <- earth(SalePrice~YearBuilt,train[,-1],nprune = 2)
#building data frame of predicted values from the mars model
predictedValues <- data.frame(YearBuilt=train["YearBuilt"],predict(yearBuiltBinned,train))

#plotting what it looks like with the hinge function
predictedValues %>% 
  ggplot(.,aes(YearBuilt,SalePrice))+
  geom_smooth(color="red",size=3)+
  geom_point(data=train,aes(x=YearBuilt,y=SalePrice))+
  geom_point()+
  annotate("segment",x=1971,xend=1971,y=0,yend=350000,color="red")+
  labs(title="Hinge at 1971")

#adding the binary variable
train$OldHouse <- ifelse(train$YearBuilt<=1971,1,0)
test$OldHouse <- ifelse(test$YearBuilt<=1971,1,0)

Near-Zero Variance Predictors:

Near-zero variance predictors enter extra variance into our model while supplying little information. The predictors in this first iteration of feature engineering that will be dropped will be chosen very conservatively. Only the predictors of Street, Utilities, Heating, and PoolQC will be dropped.

noVariance <- nearZeroVar(train,freqCut = 40,uniqueCut = .5,saveMetrics = TRUE)
noVariance %>% filter(nzv==TRUE)
##            freqRatio percentUnique zeroVar  nzv
## Street     242.33333     0.1369863   FALSE TRUE
## Utilities 1459.00000     0.1369863   FALSE TRUE
## Heating     79.33333     0.4109589   FALSE TRUE
## PoolQC     484.33333     0.2739726   FALSE TRUE
train <- train %>% select(-Street,-Utilities,-Heating,-PoolQC)
test <- test %>% select(-Street,-Utilities,-Heating,-PoolQC)

Correlated Predictors:

The highly correlated numeric predictors we found that will be dropped are total rooms above ground, first floor sf, garage area, the year the garage was built, and WoodDeckSF.

numericPreds <- train[,c(-1,-77)] %>% select_if(is.numeric)
correlations <- as.data.frame(as.table(cor(numericPreds,use="complete.obs"))) %>% filter(Freq!=1) %>% arrange(desc(Freq))
correlations <- correlations[seq(1,nrow(correlations),by=2),]
head(correlations)
##            Var1         Var2      Freq
## 1    otherRooms TotRmsAbvGrd 0.8700782
## 3    GarageArea   GarageCars 0.8314807
## 5   GarageYrBlt    YearBuilt 0.8256675
## 7     X1stFlrSF  TotalBsmtSF 0.8224694
## 9  TotRmsAbvGrd    GrLivArea 0.8209748
## 11 totalPorchSF   WoodDeckSF 0.8099937
train <- train %>% select(-TotRmsAbvGrd,-X1stFlrSF,-GarageArea,-GarageYrBlt,-WoodDeckSF)
test <- test %>% select(-TotRmsAbvGrd,-X1stFlrSF,-GarageArea,-GarageYrBlt,-WoodDeckSF)

From the data analysis there are some categorical predictors that have nearly all the same labels. These variables will be dropped now.

train <- train %>% select(-Condition2,-Exterior2nd,-BsmtFinType2)
test <- test %>% select(-Condition2,-Exterior2nd,-BsmtFinType2)

Ordered Predictors:

Many of the predictors can be represented as ordered predictors. This will lower the amount of factor levels which will improve computation time and can lower the chance of over fitting.

noNone <- c('Po' = 0, 'Fa' = 1, 'TA' = 2, 'Gd' = 3, 'Ex' = 4)
withNone <-c('none' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)

train$ExterQual<-as.integer(revalue(train$ExterQual, noNone))
train$ExterCond<-as.integer(revalue(train$ExterCond, noNone))
train$KitchenQual <- as.integer(revalue(train$KitchenQual,noNone))
train$HeatingQC <- as.integer(revalue(train$HeatingQC,noNone))

train$BsmtQual <- as.integer(revalue(train$BsmtQual,withNone))
train$BsmtCond <- as.integer(revalue(train$BsmtCond,withNone))
train$FireplaceQu <- as.integer(revalue(train$FireplaceQu,withNone))
train$GarageQual <- as.integer(revalue(train$GarageQual,withNone))
train$GarageCond <- as.integer(revalue(train$GarageCond,withNone))
test$ExterQual<-as.integer(revalue(test$ExterQual, noNone))
test$ExterCond<-as.integer(revalue(test$ExterCond, noNone))
test$KitchenQual <- as.integer(revalue(test$KitchenQual,noNone))
test$HeatingQC <- as.integer(revalue(test$HeatingQC,noNone))

test$BsmtQual <- as.integer(revalue(test$BsmtQual,withNone))
test$BsmtCond <- as.integer(revalue(test$BsmtCond,withNone))
test$FireplaceQu <- as.integer(revalue(test$FireplaceQu,withNone))
test$GarageQual <- as.integer(revalue(test$GarageQual,withNone))
test$GarageCond <- as.integer(revalue(test$GarageCond,withNone))
train$LandSlope <- as.integer(revalue(train$LandSlope,c("Gtl"=0,"Mod"=1,"Sev"=2)))
train$CentralAir <- as.integer(revalue(train$CentralAir,c('N'=0,'Y'=1)))
train$Electrical <- as.integer(revalue(train$Electrical,c('Mix'=0,'FuseP'=1,'FuseF'=2,'FuseA'=3,'SBrkr'=4)))
train$MasVnrType <- as.integer(revalue(train$MasVnrType,c('None'=0,'BrkCmn'=1,'BrkFace'=2,'Stone'=3)))
train$BsmtExposure <- as.integer(revalue(train$BsmtExposure,c('none'=0,'No'=1,'Mn'=2,'Av'=3,'Gd'=4)))
train$GarageFinish <- as.integer(revalue(train$GarageFinish,c('none'=0,'Unf'=1,'RFn'=2,'Fin'=3)))
train$PavedDrive <- as.integer(revalue(train$PavedDrive,c('N'=0,'P'=1,'Y'=2)))
train$Fence <- as.integer(revalue(train$Fence,c('MnWw'=0,'GdWo'=1,'MnPrv'=2,'GdPrv'=3,'none'=4)))
test$LandSlope <- as.integer(revalue(test$LandSlope,c("Gtl"=0,"Mod"=1,"Sev"=2)))
test$CentralAir <- as.integer(revalue(test$CentralAir,c('N'=0,'Y'=1)))
test$Electrical <- as.integer(revalue(test$Electrical,c('Mix'=0,'FuseP'=1,'FuseF'=2,'FuseA'=3,'SBrkr'=4)))
test$MasVnrType <- as.integer(revalue(test$MasVnrType,c('None'=0,'BrkCmn'=1,'BrkFace'=2,'Stone'=3)))
test$BsmtExposure <- as.integer(revalue(test$BsmtExposure,c('none'=0,'No'=1,'Mn'=2,'Av'=3,'Gd'=4)))
test$GarageFinish <- as.integer(revalue(test$GarageFinish,c('none'=0,'Unf'=1,'RFn'=2,'Fin'=3)))
test$PavedDrive <- as.integer(revalue(test$PavedDrive,c('N'=0,'P'=1,'Y'=2)))
test$Fence <- as.integer(revalue(test$Fence,c('MnWw'=0,'GdWo'=1,'MnPrv'=2,'GdPrv'=3,'none'=4)))

Grouping Rare Levels:

Any levels with less than fifteen occurrences in the train set will be grouped. Most of these will be done by replacing the value with the mode. In other cases the rare levels will be grouped with a level that is similar.

train$MSSubClass <- as.character(train$MSSubClass)
test$MSSubClass <- as.character(test$MSSubClass)

train[train$MSSubClass %in% c("40","80"),"MSSubClass"] <- mode(train$MSSubClass)
train[train$MSZoning %in% c("C (all)"),"MSZoning"] <- mode(train$MSZoning)
train[train$LotShape %in% c("IR1","IR2","IR3"),"LotShape"] <- "irregular"
train[train$LotConfig %in% c("FR2","FR3"),"LotConfig"] <- "twoPlus"
train[train$Neighborhood %in% c("Blueste","NPkVill","Veenker"),"Neighborhood"] <- mode(train$Neighborhood)
train[train$Condition1 %in% c("RRAe","RRAn","RRNe","RRNn"),"Condition1"] <- "nearRailroad"
train[train$RoofStyle %in% c("Flat","Gambrel","Mansard","Shed"),"RoofStyle"] <- mode(train$RoofStyle)
train[train$Foundation %in% c("Stone","Wood"),"Foundation"] <- mode(train$Foundation)
train[train$Functional %in% c("Maj1","Maj2","Sev"),"Functional"] <- "other"
train[train$GarageType %in% c("2Types","CarPort"),"GarageType"] <- "other"
train[train$MiscFeature %in% c("Gar2","Othr","Shed","TenC"),"MiscFeature"] <- "hasMisc"
train[train$SaleType %in% c("Con","ConLD","ConLI","ConLw","CWD","Oth"),"SaleType"] <- "other"
train[train$SaleCondition %in% c("AdjLand","Alloca","Family"),"SaleCondition"] <- "other"

Same for the test.

test[test$MSSubClass %in% c("40","80"),"MSSubClass"] <- mode(test$MSSubClass)
test[test$MSZoning %in% c("C (all)"),"MSZoning"] <- mode(test$MSZoning)
test[test$LotShape %in% c("IR1","IR2","IR3"),"LotShape"] <- "irregular"
test[test$LotConfig %in% c("FR2","FR3"),"LotConfig"] <- "twoPlus"
test[test$Neighborhood %in% c("Blueste","NPkVill","Veenker"),"Neighborhood"] <- mode(test$Neighborhood)
test[test$Condition1 %in% c("RRAe","RRAn","RRNe","RRNn"),"Condition1"] <- "nearRailroad"
test[test$RoofStyle %in% c("Flat","Gambrel","Mansard","Shed"),"RoofStyle"] <- mode(test$RoofStyle)
test[test$Foundation %in% c("Stone","Wood"),"Foundation"] <- mode(test$Foundation)
test[test$Functional %in% c("Maj1","Maj2","Sev"),"Functional"] <- "other"
test[test$GarageType %in% c("2Types","CarPort"),"GarageType"] <- "other"
test[test$MiscFeature %in% c("Gar2","Othr","Shed","TenC"),"MiscFeature"] <- "hasMisc"
test[test$SaleType %in% c("Con","ConLD","ConLI","ConLw","CWD","Oth"),"SaleType"] <- "other"
test[test$SaleCondition %in% c("AdjLand","Alloca","Family"),"SaleCondition"] <- "other"

Month sold should be a string instead of an integer because there is no clear ordering to it.

train$MoSold <- as.character(train$MoSold)
test$MoSold <- as.character(test$MoSold)

Removing Outlier:

Observation 1299 of the train data kept appearing as an outlier so will be removed from the data set.

train <- train[-1299,]

10 Modeling

Another random forest model will be run to see if the results beat our baseline model. The random forest model requires little to no pre-processing.

set.seed(12334)
rfMod <- train(SalePrice~.,data=train[,-1],#1=Id
                trControl=ctrl,
                metric="rmsle",
                maximize=FALSE,
                tuneGrid=expand.grid(mtry=seq(20,40,by=5)))

saveRDS(rfMod,"rfMod.rds")

The changes made created a slight improvement in our results.

rfMod <- readRDS("rfMod.rds")
min(rfMod$results$rmsle)
## [1] 0.1354918
data.frame(model=c("baseline","rfModel"),rmsle=c(min(rfBase$results$rmsle),min(rfMod$results$rmsle)))
##      model     rmsle
## 1 baseline 0.1394914
## 2  rfModel 0.1354918

The following models will require us to make changes to the data such as standardization and dummy variables.

Standardization:

makeStandard <- preProcess(train[,c(-1,-69)],method=c("center","scale"))#1=Id,69=SalePrice

standardTrain <- predict(makeStandard,train)

Dummy Variables:

After creating dummy variables and dropping id there are a total of 157 variables, with one of them being the dependent variable sales price.

for(name in names(standardTrain)){
  if(is.character(train[,name])){
    train[,name] <- as.factor(train[,name])
  }
}
makeDummy <- dummyVars(~.,standardTrain,fullRank = TRUE)
trainClean <- predict(makeDummy,standardTrain)

Lasso:

set.seed(342)
lassoMod <- train(SalePrice~.,data=trainClean[,-1],#1=Id
                  method="lasso",
                  trControl=ctrl,
                  tuneGrid=expand.grid(fraction=seq(.05,.8,by=.05)),
                  metric="rmsle",
                  maximize=FALSE)

saveRDS(lassoMod,"lassoMod.rds")
lassoMod <- readRDS("lassoMod.rds")
min(lassoMod$results$rmsle)
## [1] 0.1472279

Cubist Mod:

set.seed(1234)
cubistMod <- train(SalePrice~.,data=trainClean[,-1],#1=Id
                   method="cubist",
                   trControl=ctrl,
                   metric="rmsle",
                   maximize=FALSE,
                   tuneGrid=expand.grid(committees=seq(70,100,by=5),neighbors=c(7,8,9)))
saveRDS(cubistMod,"cubistMod.rds")
cubistMod <- readRDS("cubistMod.rds")
min(cubistMod$results$rmsle)
## [1] 0.1220886

Round 1 Results:

The cubist model performed the best of the three chosen models in the first round of modeling.

modelResults <- resamples(list(lasso=lassoMod,
                               cubist=cubistMod,
                               randomForest=rfMod))
bwplot(modelResults,metric="rmsle")


11 Feature Engineering 2

Near Zero Variance:

When near zero variance predictors were removed before the parameters for what define a near zero variance predictor were very strict. Now the default parameters will be used. This drops an additional fifteen variables.

test <- test[,-nearZeroVar(train)]
train <- train[,-nearZeroVar(train)]
train$OverallCondSQ <- train$OverallQual^2
train$GrLivAreaSQ <- train$GrLivArea^2
train$GarageCarsSQ <- train$GarageCars^2
train$TotalBsmtSFSQ <- train$TotalBsmtSF^2
train$ExterQualSQ <- train$ExterQual^2

test$OverallCondSQ <- test$OverallQual^2
test$GrLivAreaSQ <- test$GrLivArea^2
test$GarageCarsSQ <- test$GarageCars^2
test$TotalBsmtSFSQ <- test$TotalBsmtSF^2
test$ExterQualSQ <- test$ExterQual^2

Standardizing Data:

The numeric predictors will be standardized again.

SalePrice <- train$SalePrice
train <- train %>% select(-SalePrice)

makeStandard <- preProcess(train,method=c("center","scale"))

standardTrain <- predict(makeStandard,train)
standardTest <- predict(makeStandard,test)

The train and test data set will be merged so that the dummy variables are the same across the two sets. The first level of each categorical variable will be set to the mode. This will be important for when we convert our variables into dummy variables the mode of each category will become the baseline. If we were to drop a dummy variable because of its rareness it will be the same as converting our rare variable into the mode level.

combined <- rbind(standardTrain,standardTest)

# making the mode the reference level and changing characters into factors
for(name in names(combined)){
  if(is.character(combined[,name])){
    combined[,name] <- as.factor(combined[,name])
    baseLine <- mode(combined[,name])
    combined[,name] <- relevel(combined[,name],ref=as.character(baseLine))
  }
}

makeDummy <- dummyVars(~.,data=combined,fullRank = TRUE)# id=1,saleprice=70
combinedClean <- predict(makeDummy,combined)
combinedClean <- as.data.frame(combinedClean)

trainClean <- combinedClean[1:1459,]
trainClean$SalePrice <- SalePrice

testClean <- combinedClean[1460:2918,]

For some of the more rare labels it is possible for them to only exist in the train set or only exist in the test set. If they only exist in the train set it will lead to errors when trying to make predictions on the test set. If they exist only in the test set it will lead to unnecessary variance when building our models.

# names not in the train set
for(name in names(trainClean)){
  if(sum(trainClean[,name])==0){
    print(name)
  }
}
## [1] "MSSubClass.150"
trainClean <- trainClean %>%
  select(-MSSubClass.150)

testClean <- testClean %>% 
  select(-MSSubClass.150)
# names not in the test set
for(name in names(testClean)){
  if(sum(testClean[,name])==0){
    print(name)
  }
}
## [1] "HouseStyle.2.5Fin"
## [1] "Exterior1st.ImStucc"
## [1] "Exterior1st.Stone"
trainClean <- trainClean %>% 
  select(-HouseStyle.2.5Fin,-Exterior1st.ImStucc,-Exterior1st.Stone)

testClean <- testClean %>% 
  select(-HouseStyle.2.5Fin,-Exterior1st.ImStucc,-Exterior1st.Stone)

High Correlations:

Correlated values above .9 will be removed from the train and test set.

correlatedColumns <- findCorrelation(cor(trainClean[,c(-1,-149)]),cutoff = .9)

testClean <- testClean[,-correlatedColumns]
trainClean <- trainClean[,-correlatedColumns]
trainClean$SalePrice <- log(trainClean$SalePrice)

ctrl <- trainControl(method="cv",number=10)

12 Modeling 2

Lasso 2:

set.seed(342)
lassoMod2 <- train(SalePrice~.,data=trainClean[,-1],#1=Id
                  method="lasso",
                  trControl=ctrl,
                  #metric="rmsle",
                  tuneGrid=expand.grid(fraction=seq(.2,.7,by=.05)),
                  maximize=FALSE)

saveRDS(lassoMod2,"lassoMod2.rds")
lassoMod2 <- readRDS("lassoMod2.rds")
min(lassoMod2$results$RMSE)
## [1] 0.1227373

Cubist Mod 2:

set.seed(1234)
cubistMod2 <- train(SalePrice~.,data=trainClean[,-1],
                   method="cubist",
                   trControl=ctrl,
                   #metric="rmsle",
                   maximize=FALSE,
                   tuneGrid=expand.grid(committees=seq(70,100,by=5),neighbors=c(7,8,9)))
saveRDS(cubistMod2,"cubistMod2.rds")
cubistMod2 <- readRDS("cubistMod2.rds")
min(cubistMod2$results$RMSE)
## [1] 0.120205

Results:

Both the lasso and cubist performed slightly better after the second round of feature engineering.

modelResults2 <- resamples(list(lasso=lassoMod2,
                               cubist=cubistMod2))
bwplot(modelResults,metric="rmsle")

bwplot(modelResults2,metric="RMSE")

13 Final Submission

The final model submitted was the cubist2 model. It achieved a rmsle of .12077 on the test set.

#exp to transform back to original units
outputDF <- data.frame(ID=test$Id,SalePrice=exp(predict(cubistMod2,testClean)))

write.csv(x = outputDF,file = "submission2.csv",row.names=FALSE)